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Abstract 

Using simulated collider data for p + p ^ 2Jets interactions in a 2-barrel 
pixel detector, a neural network is trained to construct the coordinate of the 
primary vertex to a high degree of accuracy. Three other estimates of this 
coordinate are also considered and compared to that of the neural network. It 
is shown that the network can match the best of the traditional estimates. 
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1 Introduction 



Artificial Neural Networks (ANNs) are increasingly gaining attention within the high 
energy physics community [1] . The interest emanates from a variety of features com- 
mon to all ANNs; in short, they are fast and robust (see [2] and references therein). 
The former property has justified the use of ANNs as triggers, and the latter has 
made them useful in off-line analyses as well. Most applications have been concerned 
with the general problem of discrimination between an object of one type from an 
object of another type, e.g., a quark jet from a gluon jet or W/Z decays from the 
QCD background. There are a relatively small number of applications estimating 
analog quantities, such as the invariant mass of hadronic jets [3], the slope of a track, 
or the coordinate of a primary vertex in a drift chamber [4]. The accuracies of the 
measurements have been typically low (a ~ millimeters) due to the low resolution 
of the detectors involved. In this article we probe the limits of accuracy with which 
an ANN can estimate the z-coordinate (position along the beam line) of the primary 
vertex of a pp collision in a pixel detector, where spatial resolution is of the order of 
tens of microns. We also compare our results with some traditional methods, exhibit- 
ing the high accuracy of the neural network's estimates. The method is not restricted 
to use in pixel detectors, but can be applied to any detector with a barrel geometry. 



2 Neural Networks 



There exists a plethora of statistical techniques for analyzing data, and in combination 
with the variations emerging due to specialized needs, one is faced with the difficulty 
of choosing the "best" method of analysis. At the same time, traditional methods 
invariably have inherent assumptions that limit their applicability. For instance, 
distributions are commonly assumed to be gaussian - an assumption that may easily 
be violated. Neural Networks, on the other hand, are free of assumptions regarding 
the a-priori distribution of the dataQ They are designed to extract existing patterns 
from noisy data. The procedure involves training a network with a large sample of 
representative data, after which one exposes the network to data not included in the 
training set with the aim of predicting the new outcomes. Specifically, a feed-forward 
ANN has some number of input and output nodes, characterizing respectively the 
independent and dependent variables of an underlying map which is to be learned 
by the network. There may also be one or more hidden layers with some number 

^It would be premature, however, to conclude that ANNs are a panacea; see [5] for some examples, 
in the context of discrimination, where traditional methods outperform neural networks. 
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of nodes on each. The value Gj of the j-th node (on the first hidden layer) is given 
by Uj = fi^iOJjiCFi + 6j), where Uji are the weights connecting the i-th input node 
(whose value is o"j) to the j-th hidden node whose "activation" threshold is 9j. A 
similar rule applies to the nodes on the second hidden layer, as well as those on the 
output layer, with the values of the nodes on any layer being determined from the 
ones on the previous layer. Typically, the activation function / for the first layer is a 
sigmoid function, and for any remaining layer is either a sigmoid or a linear function. 

Training an ANN involves minimizing the sum of the differences-squared of the 
outputs of the network and the targeted values; the initial randomly assigned weights 
are modified according to some learning rule in order to minimize this quantity (called 
the energy function). Subsequently, the input nodes are assigned values (not in the 
training set), and the value of the output nodes, as determined from the trained 
weights leading to them, are taken to be the predicted value of the dependent variables 
in question. The performance of the network is monitored by a validation set, which 
is another set of known independent /dependent variables, not used in the training, 
and whose target values are compared to the values predicted by the trained network; 
the comparison is done in a variety of ways, one of which will be discussed below. 

The particular ANN program used in this study was a modified version of one 
obtained from [6]. The original source codes, written in C++, were designed to be 
compiled and executed on DOS machines. For our purposes the source codes were 
modified to run in the UNIX environment. The modified version allows us to use a 
large number of nodes on each layer as well as to view the network's weights. Usu- 
ally the weights are uninterpretablc due to the presense of hidden layers and the 
nonlincarity of the activation function; however, in the present study the particu- 
lar architecture of the network does allow for an unambiguous interpretation. The 
activation function for all the layers is taken to be the logistic (or fermi) function, 
/(x) = l/(l + exp(-x)). 

The energy surface whose minimum is to be found is well-known for being in- 
fested with local minima. The particular ANN program used here offers two methods 
for eluding the local minima - Simulated Annealing and a Genetic Algorithm. For 
this study we employed the former, both for initiating a set of weights that could 
then be evolved according to the learning rule, and for attempting to escape the local 
minimum when the learning rule was incapable of doing so. The particular learning 
rule adopted here was the conjugate gradient method, a variation of the more familiar 
back-propagation method [6]. 

In training an ANN some transformation of the data is inevitable. For instance, 
given the range of the fermi function, one must scale the target values to lie in the 
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range to +1 (for numerical reasons, it is advisable to shrink that range to 0.1 to 
0.9). Because of the asymptotic behavior of the fermi function for both small and 
large values of the domain, it is beneficial to scale the independent variables to lie in 
a similar range as well. Later, we will discuss one additional transformation of the 
data. 



3 Vertex Detectors 



For the application at hand, we trained a network using data simulating the reaction 
p + p ^ 2Jets + X as would be detected by the SDC hybrid pixel detector [7] being 
designed for use at the SSC until the project was discontinued. The geometry of 
the detector is that of eight concentric barrels surrounding the beam pipe. Only the 
inner two are covered with pixels wafers and it is data from these barrels we find 
sufficient to obtain the desired resolution of the primary vertex's position (zq) along 
the colliding beam path. The radii of the two barrels are ri = 6cm and r2 = 8cm; 
however, due to an overlapping of adjacent pixel wafers (approximately 1 cm^ in size) 
attached to these barrels, the radial coordinates of a particle (when it hits the pixels) 
differ slightly from these values. Each wafer contains 12 columns by 64 rows of 50 fim 
by 300 /im pixels and the length of the barrels is approximately 34cm {z = —17cm 
to z = +17cm). 

The simulation program SDCSIM, modified to include the pixel detector for 
both simulation and reconstruction, provided a pair of coordinates, {zi,ri, 0i) and 
{z2, r2, 02) for each particle as it penetrated the two barrels. These hits were given by 
the generator rather than by reconstruction; however, we know from other simulation 
work that hits on the 2 barrels can easily be correlated via a A0 constraint to identify a 
given particle). The true coordinate of the primary vertex, zq, for each event was also 
provided by the simulation. Because the simulated SSC proton beams were narrow 
(5/im) and since we were interested only in an accurate estimate of the position of 
the primary vertex, the angular coordinates were ignored. There are, on average, 100 
particles produced in high pt events at SSC (20 TeV on 20 TeV) energies and each 
could be used to estimate Zq by extrapolating a straight line in the z-r plane through 
the two barrel hits, back to the beam axis (r = 0). For the i-th particle the estimate 
is 

_ ^1(0^2(0 -^2 (Ori(i) 
r2H) - r\(%) 



^Phil Gutierrez and Hong Wang of the OUHEP group provided this data for us. 
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The goal was to estimate the true primary vertex coordinate zq from the dis- 
tribution of z{i) values. In addition to the neural network, we considered 3 other 
estimates - the mean, the point where the histogram maximum occurs, and the me- 
dian of the distribution. Figure 1 shows the z{i) distribution for a typical event, along 
with 3 estimates of the primary vertex. Due to outliers the mean does not give a very 
good estimate of Zq. In these examples the point where the peak of the distribution 
occurs gives a much better estimate of Zq. However, this result obviously dependents 
on the histogram's bin-size and can only be used for events that produce a unique 
maximum. The median is a more robust estimate, in that it is well-defined and it is 
insensitive to the presense of outliers. A quantitative measure of exactly how good 
these estimates are, along with that of the neural network, will be presented below. 



4 Neural Networks in Vertex Detectors 

The neural network ultimately arrived at in this work was constructed in two phases. 
In the first phase, a training set of 300 simulated high pt pp events was used to 
find the optimal configuration (including the number of input nodes) of the network. 
The second phase adopted this configuration, but employed 800 simulated high pt pp 
events as the training set, and it is this set that was used for the remainder of the 
analysis. 

The Zq coordinates of the primary vertices were scaled to lie in the range 0.1 
to 0.9 and the single output node of the network was assigned this quantity {zq = 
0.4^0/17 -1-0.5). Since the number of outgoing particle tracks per collision is variable, 
we randomly chose a fixed number (55) from each event and assigned the correspond- 
ing scaled z{i) (i.e., z{i)) values to the 55 input nodes. The number 55 was chosen 
in the first phase to maximize the performance of the network (See Figure 2). In the 
first phase 282 events had 55 or more tracks and could be used, whereas in the second 
phase, 766 could be used. 

The same validation set containing 100 events was used in both phases. The 
number of events in this set having 55 or more tracks was 94. 

A network with no hidden layers, when presented the randomly selected data 
(tracks), did not meet the required degree of accuracy. Figures 3a and 3b, show the 
actual Zq coordinate versus the predicted value, when a trained network is exposed to 
the training set of 300 events (2a), as well as to the validation set (2b). For perfect 
training and perfect prediction we would expect all points to lie on a straight line 
of slope one (for both plots). It is clear that this network is not performing well at 
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all, although partial learning and prediction is clearly present in both plots. Larger 
networks with one and two hidden layers and with a variety of nodes on each layer, 
were tested with similar results. The possibility that the poor performance of these 
networks was due to the algorithm being trapped in a (shallow) local minimum was 
ruled out upon testing a host of parameters that control the simulated annealing. 

It is well-known in neural net circles that an appropriate representation of the 
training data is crucial for the proper training, as well as for the predictive uses of 
the network. In this case, the inability of the network to learn and to predict can be 
traced, not to any shortcoming of the network itself, but to the data that it has been 
expected to learn. Since the 55 tracks are picked randomly, and then presented to 
the network as input, a given input node is assigned a random value as the network 
proceeds from one event to the next. The weight connected to a given input node 
is then being updated according to a gradient rule with no unique direction for the 
gradient itself. 

To resolve this problem, we ordered the 55 input values before presenting them 
to the network. This gives each input node a unique "identity" and leads to a unique 
direction for the gradient updating and considerably improves the performance of the 
network. However, if we preprocess the training set in one additional way, not only 
does the network's performance improve, but a simple interpretation of exactly what 
the network is "doing" can be found - an attribute whose absence is often considered 
a disadvantage of neural networks. In particular, we train the network, not simply 
using the scaled coordinate zq of the primary vertex as the output node, but with 
a further transformed coordinate zq = f{zo), where / is the logistic function, given 
above. Of course, this choice is motivated by the activation function itself being the 
logistic function^. To take advantage of the 0-1 range limit of the logistic function, 
this time we scale the data (input and output) according to Zq = 0.9zo/17. 

Figures 4a, 4b, and 5a, 5b show the performance, on both the training set and 
the validation set, of a network trained with 10 events and 766 events, respectively. 
It is now clear that the network is both learning and predicting to a high degree of 
accuracy. That the points in Figure 5a lie on a straight line of slope one, may suggest 
that the network is overtrained. However, a counting of the free parameters of the 
network shows that no overfitting is occurring; the number of data points (766) is 
larger than the number of free parameters (56). A network with no hidden layers, 55 
input nodes, and 1 output node has 55 weights and 1 threshold. It is the absence of 

■^This use of the logistic function in both the target value and as an activation function, may 
seem equivalent to using a linear activation function and using zq itself as the target value. However, 
that networks with sigmoid activation functions have far better convergence properties, precludes a 
consideration of that alternative. 
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a hidden layer that makes possible the interpretation of the weights (see below) . 
To gain a quantitative measure of the accuracies involved, we compute 

, (2) 

n=l 

as an estimate of the error, where o„ and t„ refer to the output of the network and the 
target value, respectively, of the n-th event. N is the total number of events in either 
the training set or the validation set, depending on which error is being reported; 
we shall use the same symbol, E, for both cases. It is the transformed variables z 
which we use to calculate an E, and then scaled hj E = £'/(0.013) to present here. 
This scaling gives E the interpretation of a standard deviation of the distribution of 
the unsealed z^s. For comparison E is the appropriate measure of error to use for 
the other 3 estimates (the mean, the histogram maximum, and the median) as well. 
Noting that E is the standard deviation of a distribution of the residuals (o„ — t„), 
we plot a histogram of the residuals for both the training set and the validation set 
(Figures 6a and 6b). Figure 7 shows the error E, in microns, as a function of the 
number of events in the validation set. Evidently, the median of the distribution is 
the best of the three, reaching E — 70/i. 

Figure 8 plots E of the validation set (94 events) as a function of the number 
of events in the training set. It is evident that the network is predicting the target 
value with increasing accuracy, as the training set is enlarged. This general behavior 
is typical of all ANNs, and it can be shown [8] that it falls off as ~ 1/iV, approaching 
a limit whose value depends on the architecture of the network. Upon training with 
766 events, E of the validation set is at 89 microns. At this point the network has 
outperformed the mean and the maximum of the distribution, and the median is the 
only reasonable contender. 

That the mean is not a good estimate can be explained by the presence of outliers 
that do exist in the distribution (not shown in Figure 5). The median, on the other 
hand, is insensitive to the presence of outliers, hence it's superior performance. The 

maximum is also insensitive to outliers, but it suffers from ambiguities arising from 
the bin size, such as multiple maxima. We also tested an average over 5 bins around 
a maximum, and obtained only a minor improvement. 

Due to the size of the pixels, there is an error in E that can be calculated from 

150u Jl + (ri/r2)2 l l 

(7b = X — —— X —= X ^= . 

^/I2 1 - (ri/ra) \/55 ^/]V 
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The origins of the terms in this equation are as follows: The first is due to the length 
of the pixel in the z direction, the second is from the propagation of errors in equation 
(1), the third term is due to 55 measurements per event, and the last term comes 
from the propagation of errors through equation (2). With ri/r2 = 6/8 we obtain 
(Tfi = 3 microns. This suggests that the estimates for zq from the network and the 
other 3 measures do not overlap due to the intrinsic resolution of the detector. 

Given the simple architecture of the network, and the transformations performed 
on the training data, it is possible to decipher what the network is doing. Figure 9a 
shows the 55 weights connecting each of the input nodes to the single output node, 
for a network trained with non-ordered input values. This pattern resembles an 
untrained network, thereby explaining the poor performance of the network before 
the ordering. Figure 9b shows the weights for a network trained with ordered input 
values. Evidently, the network is automatically performing a cut (in z) and averaging 
over the remainder of the distribution. Of course, this is precisely the procedure 
traditionally attempted, though in this case the value of the cut is being optimally 
determined by the network, as it minimizes the output error in the training process. 

As a final point (of curiosity), it is interesting that the network tends to under- 
weight the tracks immediately adjacent to the central plateau. In order to determine 
if this was a peculiarity of the particular network at hand, we trained networks using a 
varying number of input nodes (i.e. tracks), and found that this behavior is persistent. 
The "reason" for this diffractive behavior is unclear to us, apart from the fact that it 
is necessary to optimize the performance of the network. 
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Figure 1 : Histogram of z coordinates for a typical event, with the various 
estimates of the actual coordinate of the primary vertex for that event. 
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Figure 3a: The actual vs. the predicted values of z for 282 events 
from a network trained with the same 282 events, before ordering the input nodes 
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Figure 3b: The actual vs. the predicted values of z for 94 events 
from a network trained with 282 events, before ordering the input nodes. 
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Figure 4b: The actual vs. the predicted values of z for 94 events 
from a network trained with 1 events. 
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Figure 5a: The actual vs. the predicted values of z for 766 events 
from a network trained with the same 766 events. 




Figure 5b: The actual vs. the predicted values of z for 94 events 
from a network trained with 766 events. 
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Figure 6a: Histogram of 766 resdiuals of tine training set 
(Bin size= 15 microns) 
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Figure 6b: Histogram of 94 resdiuals of tine training set 
(Bin size= 60 microns) 
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Figure 7: The validation error (in microns) vs. the number of predicted events 

for 3 estimates. 
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Figure 8: The validation error for 94 events, as a function of 
tlie number of events in tlie training set of tlie neural network. 
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Figure 9b: The weights leading out of the 55 nodes, 
after the input nodes are ordered. 
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